---
title: Enforce custom rate limits
description: Learn how to set up granular custom rate limits for your TensorZero Gateway.
---

The TensorZero Gateway supports granular custom rate limits to help you control usage and costs.

Rate limit rules have three key components:

- **Resources:** Define what you're limiting (like model inferences or tokens) and the time window (per second, hour, day, week, or month). For example, "1000 model inferences per day" or "500,000 tokens per hour".
- **Priority:** Control which rules take precedence when multiple rules could apply to the same request. Higher priority numbers override lower ones.
- **Scope:** Determine which requests the rule applies to. You can set global limits for all requests, or targeted limits using custom tags like user IDs.

## Learn rate limiting concepts

Let's start with a brief tutorial on the concepts behind custom rate limits in TensorZero.

You can define custom rate limiting _rules_ in your TensorZero configuration using `[[rate_limiting.rules]]`.
Your configuration can have multiple rules.

Rate limit state is stored in Postgres, so restarting the gateway preserves existing limits and multiple gateway instances automatically share the same limits.

<Warning>

Tracking begins when a rate limit rule is first applied to a request.
Requests made before a rule was configured do not count towards its limit.
Modifying a rate limit rule resets its usage.

</Warning>

### Resources

Each rate limiting rule can have one or more _resource limits_.
A resource limit is defined using the `RESOURCE_per_WINDOW` syntax.
For example:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
model_inferences_per_day = 1_000
tokens_per_second = 1_000_000
# ...
```

Time windows are sequential and non-overlapping (i.e. not a sliding window).
They are aligned to when each rate limit bucket is first initialized (not sliding windows).
For example, if a rule with a `RESOURCE_per_minute` limit is first used at 10:30:15, it'll be refilled at 10:31:15, 10:32:15, and so on.

<Note>

You must specify `max_tokens` for a request if a token limit applies to it.
The gateway makes a reasonably conservative estimate of token usage and later records the actual usage.

</Note>

### Scope

Each rate limiting rule can optionally have a _scope_.
The scope restricts the rule to certain requests only.
If you don't specify a scope, the rule will apply to all requests.

You can scope rate limiting rules by tags or by API key public ID.

#### By tags

You can scope rate limits using user-defined `tags`.
You can limit the scope to a specific value, to each individual value (`tensorzero::each`), or to every value collectively (`tensorzero::total`).

For example, the following rule would only apply to inference requests with the tag `user_id` set to `intern`:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "intern" }
]
#...
```

If a scope has multiple entries, all of them must be met for the rule to apply.

For example, the following rule would only apply to inference requests with the tag `user_id` set to `intern` _and_ the tag `env` set to `production`:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "intern" },
    { tag_key = "env", tag_value = "production" }
]
#...
```

Entries based on `tags` support two special strings for `tag_value`:

- `tensorzero::each`: The rule independently applies to every `tag_key` value.
- `tensorzero::total`: The limits are summed across all values of the tag.

For example, the following rule would apply to each value of the `user_id` tag individually (i.e. each user gets their own limit):

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "tensorzero::each" },
]
#...
```

Conversely, the following rule would apply to all users collectively:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "tensorzero::total" },
]
#...
```

<Warning>

The rule above won't apply to requests that do not specify a `user_id` tag.

</Warning>

#### By API keys

You can scope rate limits using API keys when authentication is enabled.
This allows you to enforce different rate limits for different API keys, which is useful for implementing tiered access or preventing individual keys from consuming too many resources.

You can limit the scope to each individual API key (`tensorzero::each`) or to a specific API key by providing its 12-character public ID.

For example, the following rule would apply to each API key individually (i.e. each API key gets its own limit):

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { api_key_public_id = "tensorzero::each" },
]
#...
```

You can also target a specific API key by providing its 12-character public ID:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { api_key_public_id = "xxxxxxxxxxxx" },
]
#...
```

<Tip>

TensorZero API keys have the following format:

`sk-t0-xxxxxxxxxxxx-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy`

The `xxxxxxxxxxxx` portion is the 12-character public ID that you can use in rate limiting rules.
The remaining portion of the key is secret and should be kept secure.

</Tip>

Unlike tag scopes, API key public ID scopes do not support `tensorzero::total`.
Only `tensorzero::each` and concrete 12-character public IDs are supported.

<Warning>

Rules with `api_key_public_id` scope won't apply to unauthenticated requests.
Learn how to [set up auth for TensorZero](/operations/set-up-auth-for-tensorzero).

</Warning>

### Priority

Each rate limiting rule must have a _priority_ (e.g. `priority = 1`).
The gateway iterates through the rules in order of priority, starting with the highest priority, until it finds a matching rate limit; once it does, it enforces all rules with that priority number and disregards any rules with lower priority.

For example, the configuration below would enforce the first rule for requests with `user_id = "intern"` and the second rule for all other `user_id` values:

```toml title="tensorzero.toml"
[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "intern" },
]
priority = 1
#...

[[rate_limiting.rules]]
# ...
scope = [
    { tag_key = "user_id", tag_value = "tensorzero::each" },
]
priority = 0
#...
```

Alternatively, you can set `always = true` to enforce the rule regardless of other rules; rules with `always = true` do not affect the priority calculation above.

## Set up rate limits

Let's set up rate limits for an application to restrict usage depending on an user-defined tag for user IDs.

<Tip>

You can find a [complete runnable example](https://github.com/tensorzero/tensorzero/tree/main/examples/docs/guides/operations/enforce-custom-rate-limits) of this guide on GitHub.

</Tip>

<Steps>

<Step title="Set up Postgres">

You must set up Postgres to use TensorZero's rate limiting features.

See the [Deploy Postgres](/deployment/postgres) guide for instructions.

</Step>

<Step title="Configure rate limiting rules">

Add to your TensorZero configuration:

```toml title="config/tensorzero.toml"
# [A] Collectively, all users can make a maximum of 1k model inferences per hour and 10M tokens per day
[[rate_limiting.rules]]
always = true
model_inferences_per_hour = 1_000
tokens_per_day = 10_000_000
scope = [
    { tag_key = "user_id", tag_value = "tensorzero::total" }
]

# [B] Each individual user can make a maximum of 1 model inference per minute
[[rate_limiting.rules]]
priority = 0
model_inferences_per_minute = 1
scope = [
    { tag_key = "user_id", tag_value = "tensorzero::each" }
]

# [C] But override the individual limit for the CEO
[[rate_limiting.rules]]
priority = 1
model_inferences_per_minute = 5
scope = [
    { tag_key = "user_id", tag_value = "ceo" }
]

# [D] The entire system (i.e. without restricting the scope) can make a maximum of 10M tokens per hour
[[rate_limiting.rules]]
always = true
tokens_per_hour = 10_000_000
```

Make sure to reload your gateway.

</Step>

<Step title="Make inference requests">

If we make two consecutive inference requests with `user_id = "intern"`, the second one should fail because of rule `[B]`.
However, if we make two consecutive inference requests with `user_id = "ceo"`, both should succeed because rule `[C]` will override rule `[B]`.

<Tabs>

<Tab title="Python (TensorZero SDK)">

```python
from tensorzero import TensorZeroGateway

t0 = TensorZeroGateway.build_http(gateway_url="http://localhost:3000")


def call_llm(user_id):
    try:
        return t0.inference(
            model_name="openai::gpt-4.1-mini",
            input={
                "messages": [
                    {
                        "role": "user",
                        "content": "Tell me a fun fact.",
                    }
                ]
            },
            # We have rate limits on tokens, so we must be conservative and provide `max_tokens`
            params={
                "chat_completion": {
                    "max_tokens": 1000,
                }
            },
            tags={
                "user_id": user_id,
            },
        )
    except Exception as e:
        print(f"Error calling LLM: {e}")


# The second should fail
print(call_llm("intern"))
print(call_llm("intern"))  # should return None

# Both should work
print(call_llm("ceo"))
print(call_llm("ceo"))
```

</Tab>

<Tab title="Python (OpenAI SDK)">

```python
from openai import OpenAI

oai = OpenAI(base_url="http://localhost:3000/openai/v1")


def call_llm(user_id):
    try:
        return oai.chat.completions.create(
            model="tensorzero::model_name::openai::gpt-4.1-mini",
            messages=[
                {
                    "role": "user",
                    "content": "Tell me a fun fact.",
                }
            ],
            max_tokens=1000,
            extra_body={"tensorzero::tags": {"user_id": user_id}},
        )
    except Exception as e:
        print(f"Error calling LLM: {e}")

# The second should fail
print(call_llm("intern"))
print(call_llm("intern"))  # should return None

# Both should work
print(call_llm("ceo"))
print(call_llm("ceo"))
```

</Tab>

</Tabs>

</Step>

</Steps>

## Advanced

### Customize capacity and refill rate

By default, rate limits use a simple bucket model where the entire capacity refills at the start of each time window.
For example, `tokens_per_minute = 100_000` allows 100,000 tokens every minute, with the full allowance resetting at the top of each minute.

However, you can customize this behavior using the `capacity` and `refill_rate` parameters to create a token bucket that refills continuously:

```toml
[[rate_limiting.rules]]
# ...
tokens_per_minute = { capacity = 100_000, refill_rate = 10_000 }
# ...
```

In this example, the `capacity` parameter sets the maximum number of tokens that can be stored in the bucket, while the `refill_rate` determines how many tokens are added to the bucket per time window (10,000 per minute).
This creates smoother rate limiting behavior where instead of getting your full allowance at the start of each minute: you get 10,000 tokens added every minute, up to a maximum of 100,000 tokens stored at any time.
To achieve these benefits, you'll typically want to use a low time granularity with a `capacity` much larger than the `refill_rate`.

This approach is particularly useful for burst protection (users can't consume their entire daily allowance in the first few seconds), smoother traffic distribution (requests are naturally spread out over time rather than clustering at window boundaries), and a better user experience (users get a steady trickle of quota rather than having to wait for the next time window).
